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A TEXT -TO -SPEECH CONVERSION SYSTEM FOR INTERLOCKING WITH 
MULTIMEDIA AND A METHOD FOR ORGANIZING INPUT DATA OF THE SAME 



BACKGROUND OF THE INVENTION 



Field of the Invention 



The present invention relates to a text-to-speech conversion 
system (hereinafter, referred to as TTS) for interlocking with 
multimedia and a method for organizing input data of the same, 
and more particularly to a text-to-speech conversion system (TTS) 
for interlocking with multimedia and a method for organizing 
input data of the same for enhancing the natural of synthesized 
speech and accomplishing synchronization between multimedia and 
TTS by defining additional prosody information, the information 
required to interlock TTS with multimedia, and interface between 
these information and TTS for use in the production of the 
synthesized speech. 



Description of the Related Art 



Generally, the function of the speech synthesizer is to 
provide different forms of information for a man using a 
computer. To this end, the speech synthesizer should serve the 
user with synthesized speech with high quality from a given text. 
In addition, for the interlock with database produced in 
multimedia environment such as moving picture or animation, or 
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a variety of media provided from a counterpart of conversion, the 
speech synthesizer should produce the synthesized speech to be 
synchronized with theses media. Particularly, the 

synchronization of TTS with multimedia is essential to provide 
the user with service with high quality. 

As shown in Fig, 1, typically, a conventional TTS goes 
through the process consisting of 3 steps as follows until the 
synthesized speech is produced from on inputted text. 

In a first step, a language processor 1 converts the text 
into a series of phoneme, presumes prosody information and 
symbolizes this information. Symbol of prosody information is 
presumed from a boundary of the phrase and paragraph, a location 
of accent in word, a sentence pattern, and so on using the 
analysis result of syntax. 

In a second step, a prosody processor 2 calculates a value 
of prosody control parameter from the symbolized prosody 
information using a rule and a table. Prosody control parameter 
includes duration of phoneme, pitch contour, energy contour, and 
pause interval information. 

In a third step, a signal processor 3 produces a synthesized 
speech using a synthesis unit database 4 and the prosody control 
parameter. In other words, this means that the conventional TTS 
should presume the information associated with the natural and 
speech rate in the language processor 1 and the prosody processor 
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2 only by the inputted text . 

Further, the conventional TTS has simple function to output 
data inputted by the unit of sentence as the synthesized speech. 
Accordingly, in order to output sentences stored in a file or 
sentences inputted through a communication network as the 
synthesized speech in succession, a main control program which 
reads sentences from the inputted data and transmits them to an 
input of TTS is required. Such a main control program includes 
a method to separate the text from the inputted data and then 
output the synthesized speech once from the beginning to the end, 
a method to produce the synthesized speech in interlock with a 
text editor, a method to look up the sentences by use of a 
graphic interface and produce the synthesized speech, and so on, 
but the object to which these methods are applicable is 
restricted to the text. 

At present, studies on TTS have considerably advanced for 
the vernacular language in different countries and a commercial 
use has been accomplished in some countries. However, this is 
in situation of the only use for the syntheses of speech from the 
inputted text. In addition, by a prior organization, since it 
is impossible to presume from only the text the information 
required when moving picture is to be dubbed by use of TTS or 
when the natural interlock between the synthesized speech and 
multimedia such as animation is to be implemented, there is no 
method to realize these functions. Furthermore, there is also 
no result of the studies on use of additional data for 
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enhancement of the natural in the synthesized speech and 
organization of these data. 

SUMMARY OF THE INVENTION 

Therefore^ it is an object of the present invention to 
provide a text -to- speech conversion system (TTS) for interlocking 
with multimedia and a method for organizing input data of the 
same for enhancing the natural of synthesized speech and 
accomplishing synchronization of multimedia with TTS by defining 
additional prosody information, the information required to 
interlock TTS with multimedia, and interface between these 
information and TTS for use in the production of the synthesized 
speech. 

In order to accomplish the above object, a TTS for 
interlocking with multimedia according to the present invention 
comprises a multimedia information input unit for organizing 
text, prosody, the information on synchronization with moving 
picture, lip- shape, and the information such as individual 
property; a data distributor by each media for distributing the 
information of the multimedia information input unit into the 
information by each media,- a language processor for converting 
the text distributed by the data distributor by each media into 
phoneme stream, presuming prosody information and symbolizing the 
information; a prosody processor for calculating a value of 
prosody control parameter from the symbolized prosody information 
using a rule and a table; a synchronization adjuster for 
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adjusting the duration of the phoneme using the synchronization 
information distributed by the data distributor by each media; 
a signal processor for producing a synthesized speech using the 
prosody control parameter and data in a synthesis unit database; 
and a picture output apparatus for outputting the picture 
information distributed by the data distributor by each media 
onto a screen. 

In order to accomplish the above object, a method for 
organizing input data of a text-to-speech conversion system (TTS) 
for interlocking with multimedia comprises the steps of: 
classifying multimedia input information organized for enhancing 
the natural of synthesized speech and implementing the 
synchronization of multimedia with TTS into text, prosody, the 
information on synchronization with moving picture, lip-shape, 
and individual property information in a multimedia information 
input unit; distributing the information classified in the 
multimedia information input in a data distributor by each media, 
based on respective information; converting text distributed in 
the data distributor by each media into phoneme stream, presuming 
prosody information and symbolizing the information in a language 
processor; calculating a value of prosody control parameter other 
than prosody control parameter included in multimedia information 
in a prosody processor; adjusting the duration every each phoneme 
in a synchronization adjuster so that processing result in the 
prosody processor may be synchronized with a picture signal 
according to input of the synchronization information; producing 
the synchronized speech in a signal processor using the prosody 
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information from the data distributor by each media, the 
processing result in the synchronization adjuster, and a 
synthesis unit database; and outputting the picture information 
distributed by the data distributor by each media onto a screen 
in a picture output apparatus . 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, features, aspects of the 
present invention will become more apparent from the following 
detailed description of the present invention when taken in 
conjunction with the accompanying drawings. 

FIG. 1 is a constructional view of a conventional text-to- 
speech conversion system. 

FIG. 2 is a constructional view of a hardware to which the 
present invention is applied. 

FIG. 3 is a constructional view of a text-to-speech 
conversion system according to the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

Now, the present invention will be described in detail by 
way of the preferred embodiment. 

Referring to FIG. 2, a constructional view of hardware to 
which the present invention is applied is shown. In FIG. 2, the 
hardware consists of a multimedia data input unit 5, a central 
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processing unit 6, a synthesis database 7, a digital to analog 
(D/A) converter 8, and a picture output apparatus 9. 

The multimedia data input unit 5 is inputted with data 
composed of multimedia such as picture and text and outputs this 
data to the central processing unit 6. 

The central processing unit 6 distributes the multimedia 
data input of the present invention, adjusts synchronization, and 
performs algorithm based therein to produce synthesized speech. 

The synthesis database 7 is a database used in the algorithm 
for producing the synthesized speech. This synthesis database 
7 is stored in a storage device and transmits necessary data to 
the central processing unit 6 . 

The digital to analog (D/A) converter 8 converts the 
synthesized digital data into analog signal and outputs the 
analog signal . 

The picture output apparatus 9 outputs inputted picture 
information onto a screen. 

Table 1 and 2 are algorithms illustrating the state of 
organized multimedia input information, which consists of text, 
prosody, the information on synchronization with moving picture, 
lip -shape, and individual property information. 
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[Table 1] 

Syntax 

TTS_Sequence ( ) { 

TTS_Sequence_Start_Code 

Prosody_Enable 

Vi deo_Enab 1 e 

L ip_S hape_Enab 1 e 

Start_Any_Place 

do{ 

TTS_Sentence () 

jwhile (next_bits () ==TTS__Sentence_Start_Code 

Here, the TTS_Sequence_Start_Code is a bit string 
represented with Hexadecimal 'XXXXX' and means a start of TTS 
sentence . 

The TTS_Sentence__ID is a 10-bit ID and represents a proper 
number of each TTS data stream. 

The 1 anguage_Code represents an object language such as 
Korean language, English language, German language, Japanese 
language, French language etc,, to be synthesize. 

The prosody__Enable is a 1-bit flag and has a value of 'l' 
when a prosody data of original sound is included in an organized 
data . 

The Video_Enable is a 1-bit flag and has a value of '1' when 
a TTS is interlocked with moving picture. 

The Iiip_Shape_Enable is a 1-bit flag and has a value of '1' 
when a lip- shape data is included in an organized data. 
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The Trick__Mode_Enable is a 1-bit flag and has a value of '1' 
when a data is organized to support a trick mode such as stop, 
restart, forward and backward. 



[Table 2] 



Syntax 



TTS_Sentence 0 { 

TT S_S en t enc e_S t ar t_Code 

Silence 

if (Silence) { 

Silence Duration 



else { 



} 

Gender 
Age 

if ( lVideo_Enable) { 
Speech Rate 

} 

Leng th_o f _Text 
TTS_Text 

Position_in_Sentence 
if (Prosody^Enable) { 

Numb e r_o f ^phoneme s 

Dur_Enable 

FO_Enable 

Ene r gy_En ab 1 e 

for(j=0 ; j <Number_of_phonemes ; j++) { 
Syinbol_each_phoneme 
Dur_each_phoneme 
FO_contour_each_phoneme 
Energy contour_each_plioneme 

} 

if (Video_Enable) { 

Sentence_Duration 
Posit ion_in_Sentence 
offset 

} 

if (Lip_Shape_Enable) ( 

Ntnnb er_o f _L ip_E ven t 
for ( j = 0 ; j <NuTnber_of _Lip_Event ; 
L ip_in__S en t enc e 
Lip Shape 



+ ) { 



} 



} 



} 



} 
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Here, the TTS_Sentence_Start_Code is a bit string 
represented with Hexadecimal 'XXXXX' and means a start of TTS 
sentence. And the TTS_Sentence_Start_Code is a 10-bit ID and 
represents a proper number of each TTS data stream. 

The TTS_Sentence_ID is a 10 -bit ID and represents a proper 
number of each TTS sentence existed in the TTS stream. 

The Silence become a '1' when a present input frame of 1-bit 
flag is silence speech section. 

At stage of the Silence_Duration, a duration time of present 
silence speech section is represented by milliseconds. 

At stage of the Gender, gender is distinguished from a 
synthesized speech. 

At stage of the Age, an age of the synthesized speech 
distinguished into a baby, youth, middle age and old age. 

The Speech_Rate represents a speech rate of synthesized 
speech. 

At stage of the Length_of_Text , a length of input text 
sentence is represented by byte. 

At stage of the TTS__Text, sentence text having optional 
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length is represented. 



The Dur_Enable is a 1-bit flag and become a '1' when a 
duration time information is included in an organized data. 



The FO_Contour_Enable is a 1-bit flag and become a '1' when 
a pitch information of each phoneme is included in the organized 
data . 



The Energy_Contour_Enable is a 1-bit flag and become a '1' 
when an energy information of each phoneme is included in the 
organi zed data . 



At stage of the Number_of _Phonemes , the number of phoneme 
needed to synthesize a sentence are represented. 



At stage of the Symbol_each_phoneme , symbol such as IPA 
which is to represent - each phoneme is represented. 



The Dur_each_jphoneme represents a duration time of phoneme. 



At stage of the FO_contour_each_phoneme , a pitch pattern of 
the phoneme represented by a pitch value of beginning point, mid 
point and end point of the phoneme is represented. 



At stage of the Energy_Contour_each_phoneme , energy pattern 
of the phoneme is represented and an energy value of beginning 
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point, mid point and end point of the phoneme is represented by 



decibel (dB) . 



The Sent ence_Durat ion represents a total duration time of 
synthesized speech of the sentence. 



The Position in Sentence represents a position of present 



frame in the sentence . 



At stage of the offset, when the synthesized speech is 
interlocked with moving picture and a beginning point of the 
sentence is in the GOP (Group Of Pictures) , a delay time consumed 
from beginning point of GOP to beginning point of the sentence 
is represented. 



The Number_of_Lip_Event represents the number of changing 
,oint of lip- shape in the sentence. 



The Lip_shape represents a lip-shape at lip-shape changing 



point of the sentence 



Text information includes a classification code for a used 
language and a sentence text. Prosody information includes the 
number of phoneme in the sentence, phoneme stream information, 
the duration every each phoneme, pitch pattern of phoneme, energy 
pattern of phoneme and is used for enhancing the natural of the 
synthesized speech. The synchronization information of the 
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moving picture with the synthesized speech can be considered as 
the dubbing concept and the synchronization could be realized in 
three ways . 



Firstly, there is a method to synchronize between the moving 
picture and the synthesized speech by the sentence unit by which 
the duration of the synthesized speech is adjusted using the 
information about the beginning points of sentences, the 
durations of sentences, and the delay times of the beginning 
points of sentences. The beginning points of each sentence 
indicate locations of scenes from which output of the synthesized 
speech for each sentence within the moving picture is started. 
The durations of sentences indicate the number of scenes in which 
the synthesized speech for each sentence lasts. In addition, the 
moving picture of MPEG-2 and MPEG-4 picture compression type in 
which Group of Picture (GOP) concept is used should start at not 
any scene but a beginning scene within Group of Picture for 
reproduction. Therefore, the delay time of the beginning point 
is the information required to synchronize between the Group of 
Picture and the TTS and indicates a delay time between the 
beginning scene and a speech beginning point. This method is 
easy to be realized and can minimize additional effort, but is 
difficult to accomplish natural synchronization. 



Secondly, there is a method by which beginning point 
information, end point information, and phoneme information are 
marked every each phoneme within an interval associated with 
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speech signal in the moving picture and these information is used 
to produce the synthesized speech. This method has an advantage 
that degree of accuracy is high since the synchronization between 
the moving picture and the synthesized speech by the phoneme unit 
can be attained but a disadvantage that additional effort should 
be fairly made to detect and record the duration information by 
the phoneme unit within the speech interval of the moving 
picture . 

Thirdly, there is a method to record the synchronization 
information based on the beginning point of speech, the end point 
of speech, lip -shape, and a point of time of lip -shape change. 
Lip-shape is numeralized to distance (extent of opening) between 
upper lip and lower lip, distance (extent of width) between left 
and right and points of lip, and extent of projecting of lip and 
is defined as a quantized and normalized pattern depended on 
articulation location and articulation manner of phoneme on the 
basis of pattern with high discriminative property. This method 
is a method to raise efficiency of synchronization, while 
additional effort to produce the information for synchronization 
can be minimized. 

The organized multimedia input information which is applied 
to the present invention allows an information provider to select 
and implement optionally among 3 synchronization methods as 
described above. 
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In addition, the organized multimedia input information is 
also used in the process to implement lip animation. Lip 
animation can be implemented by using phoneme stream prepared 
from the inputted text in the TTS and the duration every each 
phoneme, or phoneme stream distributed from the input information 
and the duration every each phoneme, or by using the information 
on lip-shape included in the inputted information. 

The individual property information allows the user to 
change gender, age, and speech rate of the synthesized speech. 

Gender has male and female, and age is classified into 4, for 
example, 6-7 years, 18 years, 4 0 years, and 65 years. The change 
of speech rate may have 10 steps between 0.7 and 1.6 times of a 
standard rate. Quality of the synthesized speech can be 
diversified using these information. 

FIG, 3 is a constructional view of the text-to-speech 
conversion system (TTS) according to the present invention. In 
FIG. 3, the TTS consists of a multimedia information input unit 
10, a data distributor by each media 11, a standardized language 
processor 12, a prosody processor 13, a synchronization adjuster 
14, a signal processor 15, a synthesis unit database 16, and a 
picture output apparatus 17 . 

The multimedia input unit 10 is configured as form of Table 
1 and 2 and comprises text, prosody information, the information 
on synchronization with moving picture, the information on lip- 
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shape. Among these, requisite information is text, other 
information can be optionally provided by an information provider 
as optional item for enhancing the individual property and the 
natural and accomplishing the synchronization with the 
multimedia, and if needed, can be amended by a TTS user by means 
of a character input device (keyboard) or a mouse . These 
information is transmitted to the data distributor by each media 

11. 



The data distributor by each media 11 receives the 
multimedia information of which the picture information is 
transmitted to the picture output apparatus 17, text is 
transmitted to the language processor 12, and the synchronization 
information is converted into data structure capable of utilizing 
in the synchronization adjuster 14 and transmitted to the 
synchronization adjuster 14. If prosody information is included 
in the inputted multimedia information, this multimedia 
information is converted into data structure capable of utilizing 
in the signal processor 15 and then transmitted to the prosody 
processor 13 and the synchronization adjuster 14. If individual 
property information is included in the inputted multimedia 
information, this multimedia information is converted into data 
structure capable of utilizing in the synthesis unit database 16 
and the prosody processor 13 within the TTS and then transmitted 
to the synthesis unit database 16 and the prosody processor 13. 

The language processor 12 converts text into phoneme stream. 
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presumes prosody information, symbolizes this information, and 
then transmits the symbolized information to the prosody 
processor 13. The symbol of prosody information is presumed from 
a boundary of the phrase and paragraph, a location of accent in 
word, a sentence pattern, and so on using the analysis result of 
syntax. 



The prosody processor 13 takes the processing result of the 
language processor 12 and calculates a value of prosody control 
parameter other than prosody control parameter included in the 
multimedia information. Prosody control parameter includes 
duration pitch contour, energy contour, pause point, and pause 
length of phoneme. The calculated result is transmitted to the 
synchronization adjuster 14 . 



The synchronization adjuster 14 takes the processing result 
of the prosody processor 13 and adjusts the duration every each 
phoneme in order to synchronize the result with the picture 
signal. The adjustment of the duration every each phoneme 
utilizes the synchronization information transmitted from the 
data distributor by each media 11. First, lip- shape is assigned 
to each phoneme depended on articulation location and 
articulation manner of each phoneme and, on the basis of this, 
the assigned lip-shape is compared to lip-shape included in the 
synchronization information and then phoneme stream is divided 
into small groups by the number of lip -shape recorded in the 
synchronization information. Also, the duration of phoneme in 



% Express Mail 
_^_'>'jm62493442US 



the small groups is calculated again using the duration 
information of lip-shape included in the synchronization 
information. The adjusted duration information is transmitted 
to the signal processor 15, included in the processing result of 
the prosody processor 13 . 



The signal processor 15 receives the prosody information 
from the multimedia distributor 11 or the processing result of 
the synchronization adjuster 14 to produce and output the 
synthesized speech using the synthesis unit database 16. 



The synthesis unit database 16 receives the individual 
property information from the multimedia distributor 11, selects 
synthesis units adaptable to gender and age, and then transmits 
data required for synthesis to the signal processor 15 in 
response to a request from the signal processor 15. 



As can be seen from the description described above, 
according to the present invention, the individual property of 
the synthesized speech can be realized and the natural of the 
synthesized speech can be enhanced by organizing the individual 
property and prosody information presumed by the analysis of 
actual speech data, along with text information, as multistage 
information. Furthermore, a foreign movie can be dubbed in 
Korean by implementing the synchronization of the synthesized 
speech with the moving picture by way of the direct use of text 
information and lip- shape information which is presumed by the 
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analysis of actual speech data and lip-shape in the moving 
picture for the production of the synthesized speech. Still 
furthermore, the present invention is applicable to a variety of 
field such as communication service, office automation, education 
and so on by making the synchronization between the picture 
information and the TTS in the multimedia environment possible. 

Although the present invention and its advantages have been 
described in detail, it should be understood that various 
changes, substitutions and alterations can be made herein without 
departing from the spirit and scope of the invention as defined 
by the appended claims . 

It is therefore intended by the appended claims to cover any 
and all such applications, modifications, and embodiments within 
the scope of the present invention. 
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WHAT IS CLAIMED IS; 

1. A text-to-speech conversion system (TTS) for interlocking 
with multimedia comprising: 

a multimedia information input unit for organizing text , 
prosody, the information on synchronization with moving picture, 
lip-shape, and the information such as individual property; 

a data distributor by media for distributing the information 
of the multimedia information input unit into the information by 
each media ; 

a language processor for converting the text distributed by 
the data distributor by each media into phoneme stream, presuming 
prosody information and symbolizing the information; 

a prosody processor for calculating a value of prosody 
control parameter from the symbolized prosody information using 

a rule and a table; 

a synchronization adjuster for adjusting the duration of the 
phoneme using the synchronization information distributed by the 
data distributor by each media; 

a signal processor for producing a synthesized speech using 
the prosody control parameter and data in a synthesis unit 
database ; and 

a picture output apparatus for outputting the picture 
information distributed by the data distributor by each media 
onto a screen. 

2 . A method for organizing input data of a text -to- speech 
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conversion system (TTS) for interlocking with multimedia 
comprising the steps of: 

classifying multimedia input information organized for 
enhancing the natural of synthesized speech and implementing the 
synchronization of multimedia with TTS into text, prosody, the 
information on synchronization with moving picture, lip-shape, 
and individual property information in a multimedia information 
input unit; 

distributing the information classified in the multimedia 
information input in a data distributor by each media, based on 
respective information; 

converting text distributed in the data distributor by each 
media into phoneme stream, presuming prosody information and 
symbolizing the information in a language processor; calculating 
a value of prosody control parameter other than prosody control 
parameter included in multimedia information in a prosody 
processor; 

adjusting the duration every each phoneme in a 
synchronization adjuster so that processing result in the prosody 
processor may be synchronized with a picture signal according to 
input of the synchroni zat ion inf ormat ion ; 

producing the synchronized speech in a signal processor 
using the prosody information from the data distributor by each 
media, the processing result in the synchronization adjuster, and 
a synthesis unit database; and 

outputting the picture information distributed by the data 
distributor by each media onto a screen in a picture output 
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apparatus . 

3. The method according to claim 2, wherein said organized 
multimedia information is comprised of text information, prosody 
information, information synchronized with moving picture, lip- 
shape and individuality information. 

4 . The method according to claim 3 , wherein said prosody 
information is comprised of the number of phoneme, phoneme stream 
information, duration time of each phoneme, pitch pattern of the 
phoneme and energy pattern of the phoneme. 

5. The method according to claim 4, wherein said duration of the 
phoneme is indicative of a value of pitch at beginning point, 
middle point, and end point within the phoneme. 

6. The method according to claim 4, wherein said energy pattern 
of the phoneme is indicative of a value of energy in decibel at 
beginning point, mid point and end point within phoneme. 

7. The method according to claim 2, wherein said synchronization 
information is comprised of text, lip-shape, location information 
with moving picture, and the duration information. 

8. The method according to claim 2, wherein said synchronization 
information is composed of a beginning point, duration and delay 
time information of starting point, and duration of each phoneme 
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is controlled by said synchronization information. 

9. The method according to claim 2, wherein said synchronization 
information is composed of a duration of the beginning point of 
a sentence and a duration information of starting point, and 
duration of each phoneme is controlled by forecast lip-shape 
considered an articulation manner of the phoneme and articulation 
control , 

lip- shape within the synchronization and duration information 
composed of said synchronization information. 

10. The method according to claim 2, wherein said synchronized 
speech is produced by an information of beginning point and end 
point of each phoneme related with speech signal and an 
information of phoneme . 

11. The method according to claim 2, wherein said synchronized 
speech is produced by a numeralization of distance (extent of 
opening) between upper lip and low lip, distance (extent of width) 
between left and right end points of lip, and extent of 

pro j ec t ing of lip and the lip- shape quant i zed and normal i zed 
pattern depended on articulation location and articulation manner 
of the phoneme on the basis of pattern with high discriminative 
property. 

12. The method according to claim 2, wherein said transmission 
method of multimedia information comprising the steps of: 
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converting a prosody information existed in the multimedia 
information into a data structure capable of utilizing in the 
signal processors- 
transmitting the converted prosody information to the 
prosody and the synchronization adjuster; 

converting the prosody information outputed from the prosody 
and the synchronization adjuster to a data structure capable of 
utilizing in the synthesis unit database and the prosody 
processor within the TTS if the prosody information is included 
in said multimedia input information; 

transmitting then to the synthesis unit database and the 
prosody processor if the individual property information is 
included in said multimedia input information. 
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ABSTRACT 

This invention relates to a text -to- speech conversion system 
(TTS) for interlocking with multimedia and a method for 
organizing input data of the same. A conventional TTS is in 
situation of the only for the synthesis of speech from the 
inputted text. In addition, by a prior organization, since it 
is impossible to presume from only the text the information 
required when moving picture is to be dubbed by use of TTS or 
when the natural interlock between the synthesized speech and 
multimedia such as animation is to be implemented, there is no 
method to realize these function. Furthermore, there is also no 
result of the studies on use of additional data for enhancement 
of the natural in the synthesized speech and organization of 
these data. Therefore, an object of the present invention is to 
provide a text-to-speech conversion system (TTS) for interlocking 
multimedia and a method for organizing input data of the same for 
enhancing the natural of synthesized speech and accomplishing the 
synchronization of multimedia with TTS by defining additional 
prosody information, the information required to interlock TTS 
with multimedia, and interface between these information and TTS 
for use in the production of the synthesized speech. According 
to the present invention, a foreign movie can be dubbed in Korean 
by implementing the synchronization of the synthesized speech 
with the moving picture by way of the direct use of text 
information and lip-shape information which is presumed by the 
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analysis of actual speech data and lip- shape in the moving 
picture for the production of the synthesized speech. Still 
furthermore, the present invention is applicable to a variety of 
field such as communication service, office automation, education 
and so on by making the synchronization between the picture 
information and the TTS in the multimedia environment possible. 
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In re Application of 

Jung Chul LEE et aL 
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For: A Text-To-Speech Conversion System 
For Interlocking With Multimedia And A 
Method For Organizing Input Data Of 
The Same 



PRELIMINARY AMENDMENT 



Assistant Commissioner for Patents 
Washington, D.C. 20231 

SIR: 



Prior to examination of the above-identified application please amend the 



application as follows: 
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In the Specification : 

Page 8, delete the title "[Table 1]" and the table itself and substitute therefor the 
title -Table 1~ and the following table: 



Syntax 

TTS_Sequence( ) { 
TTS_Sequence_Start_Code 

TTSSentencelD 

Language_Code 
Prosody_Enable 
Video_Enable 
LipShapeEnable 

TrickJModeEnable 
do{ 

TTS Sentence ( ) 

} while (next_bits( )==TTS Sentence Start Code 



Page 9, delete the title "[Table 2]" and the table itself and substitute therefor the 
title —Table 2- and the following table: 
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TTS_Sentence( ) { 
TTS Sentence Start Code 
TTS Sentence lD ~ 
Silence 
if (Silence) { 

Silence_Duration 

else { 

Gender 
Age 

if(!Video_Enable) { 
Speech Rate 

} 

Length of Text 

TTS_Textr) 
if(Prosody Enable) { 

Dur Enable 

FOJContour_Enable 

Energy_Contour_Enable 
Number_of_Phonemes 

for(j = 0 ; j < Number_ofj)honemes ; j + + ) { 
Symbol each_phoneme 
if(Dur_Enable) { 
Dur_each_phoneme 

if(FO_Contour_Enable { 
FO_contour_each_phoneme 

if(Energy_Contour_Enable) { 
Energy_contour_each_phoneme 

, • ' " ■ 

if(Video_Enable) { 

Sentence Duration 

PositionJin_Sentence 
offset 

if(Lip Shape_Enable) { 

Number_of_LipJEvent 

for(j = 0 ; j < Number_of_Lip_Event ; j + +) { 
LipinSentence 
Lip_shape 



} 
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In the Claims: 

Please cancel claims 1-12, without prejudice. 



Please add the following new claims: 
-13. A text-to-speech conversion system for interlocking with multimedia comprising: 

a multimedia information input unit for organizmg text, prosody 
information, information on synchronization with a movmg picture, lip-shape information, 
picture information, and mdividual property mformation; 

a data distributor by each media for distributing the information of said 
multimedia mformation input unit mto information for each media; 

a language processor for converting the text distributed by said data 
distributor by each media mto a phoneme stream, presuming prosody information and 
symbolizing the presumed prosody mformation; 

a prosody processor for calculating a prosody control parameter value from 
the symbolized prosody information; 

a synchronization adjustor for adjusting a duration of each phoneme using 
the synchronization information distributed by said data distributor by each media; 

a synthesis unit database for receiving the mdividual property mformation 
from said data distributor by each media, selecting synthesis units adaptable to gender and age, 
and ouQjutting data required for synthesis; 
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a signal processor for producing a synthesized speech usmg the prosody 
control parameter and the data output from said synthesis unit database; and 

a picture output apparatus for outputtmg the picture information distributed 
by said data distributor by each media on to a screen. 

14. A method for organizing input data of a text-to-speech conversion system for 
interlocking with multunedia, said method comprising the steps of: 

(a) classifying multimedia mput information organized for enhancmg natural 
synthesized speech and implementing synchronization of multimedia with text-to-speech into text, 
prosody information, information on synchronization with a moving picture, lip-shaped 
information, picture information, and individual property information using a multimedia 
information input unit; 

(b) distributing using a data distributor by each media the multimedia input 
information classified m the multimedia information input unit based on respective information; 

(c) converting the text distributed by the data distributor by each media into 
a phoneme stream, presuming prosody mformation and symbolizing the presumed prosody 
information using a language processor; 

(d) calculating a prosody control parameter value other than a prosody control 
parameter included in the multimedia input information using a prosody processor; 
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(e) adjusting a duration of each phoneme using a synchronization adjuster so as 
to synchronize a processing result of the prosody processor with a picture signal accordmg to 
the synchronization information distributed by the data distributor by each media; 

(f) selectmg synthesis units adaptable to gender and age based on the 
individual property information from the data distributor by each media using a synthesis unit 
database and outputting data required for synthesis; 

(g) producing synthesized speech using a signal processor based on the 
prosody mformation distributed by the data distributor by each media, a processing result of the 
synchronization adjustor, and the data from the synthesis unit database; and 

(h) outputting the picture information distributed by the data distributor by 
each media onto a screen using a picture output unit. 

15. The method m accordance with claun 14, wherein the organized multunedia 
information comprises text mformation, prosody information, mformation on synchronization 
with a moving picture, lip-shaped information, and individual property information. 

16. The method in accordance with claun 15, wherem the prosody information 
comprises a number of phoneme, phoneme stream information, duration of each phoneme, pitch 
pattern of the phoneme, and energy pattern of the phoneme. 
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17. The method in accordance with claim 16, wherein the duration time of the 
phoneme is indicative of a value of pitch at a beginning point, a mid point, and an end point 
within the phoneme. 

18. The method in accordance with claim 17, wherein the energy pattern of the 
phoneme is indicative of a value of energy in decibels at the beginning point, the mid point, and 
the end point within the phoneme. 

19. The method in accordance with claun 15, wherein the synchronization information 
comprises text, lip-shape, location mformation with a moving picture, and duration information. 

20. The method in accordance with claim 15, wherein the synchronization information 
comprises a beginning point, duration and delay time information of a startmg point, and 
duration of each phoneme is controlled by the synchronization information. 

21 . The method in accordance with claim 15, wherein the synchronization information 
is composed of a duration of a beginning point of a sentence, a duration information of a starting 
point, and duration of each phoneme is controlled by forecast lip-shape considered an articulation 
manner of the phoneme and articulation control of lip-shape within the synchronization and 
duration information of the synchronization information. 
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22. The method in accordance with claim 15, wherein the synthesized speech is 
produced based on beginning point information, end point information, and phoneme information 
for each phoneme withui an interval associated with a speech signal. 

23. The method in accordance with claim 15, wherein the synthesized speech is 
produced based on a distance of an opening between an upper lip and a lower lip, a distance 
between end points of the lips, and an extent of projection of a lip, and a lip-shape quantized 
and normalized pattern is defined depending on articulation location and articulation manner of 
the phoneme on a basis of pattern with discriminative property. 

24. The method in accordance with claim 15, wherein if the multimedia input 
information comprises prosody information, further comprising the steps of: 

(i) convertuig the prosody information into a data structure recognizable by the 
signal processor; and 

(j) transmitting the converted prosody information the prosody processor and the 
synchronization adjustor. 



25. The method in accordance with claim 15, wherein if the multimedia input 
information includes hidividual property information, further comprising the steps of: 

(k) converting the individual property information into a data structure 
recognizable by the synthesis unit database and the prosody processor within the text-to-speech; 
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(1) transmitting the converted individual property information to the synthesis unit 
database and the prosody processor.— 

REMARKS 

This preliminary amendment is presented to place the claims in better condition 
for examination. No new matter has been added. Early examination and favorable consideration 
of the above-identified application is earnestly solicited. 

Any additional fees or charges required at this tune m connection with the 
application may be charged to our Patent and Trademark Office Deposit Account No. 03-2412. 



Respectfully submitted, 



COHEN, PONT 



By 



February 9, 1998 




lEBERMAN & PAVANE 



Marilrf B. Pavane 

Reg. No. 28,337 

551 Fifth Avenue, Suite 1210 

New York, N.Y. 10176 

(212) 687-2770 
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VERFIED STATEMENT (DECLARATION) CLAIM [NG SMALL ENTITY 
STATUS (37 CFR 1.9(f) and L27(b)) - SMALL BUSINESS CONCERN 

Applicant or Patentee- 

Attorney's Docket No.- 

Serial or Patent No.- 

Filed or Issued: 

Title- A TEXT-T O -SPEECH CONVERSION SYSTEM FOR INTERLOCKING WITH 

MULTIMEDIA AND A METHOD FOR ORGANIZING INPUT DATA OF THE SAME 

I hereby declare that I am 

[ ] the owner of the small business concern identified below : 

[I an official of the small business concern empowered to act on behalf of the concern identified below: 

NAME OF SMALL BUSINESS CONCERN: 

ADDRESS OF CONCERN: 



I hereby declare that the above identified small business concern qualifies as a small business concern as 
defined in 13 CFR 12L12, and reproduced in 37 CFR L9(d), for purposes of paying reduced fees to the 
United States Patent and Trademark Office in that the number of employees of the concern, including those 
of its affiliates, does not exceed 500 persons. For purposes of this statement, (1) the number of employees 
of the business concern is the average over the previous fiscal year of the concern of the persons employed 
on a full-time, part-time or temporary basis during each of the pay periods of the fiscal year, and (2) 
concerns are affiliates of each other when either, directly or indirectly, one concern controls or has the 
power to control the other, or a third party or parties controls or has the power to control both. 

I hereby declare that rights under contract or law have been conveyed to and remain with the small 
business concern identified above with regard to the invention described in: 

[ ] the specification filed herewith with title listed above. 

[ ] the application identified above, 

[ ] the patent identified above. 

If the rights held by the above identified small business concern are not exclusive, each individual, concern 
or organization having rights to the invention must file separate verified statements averring to their status 
as small entities, and no rights to the invention are held by any person, other than the inventor, who would 
not qualify as an independent inventor under 37 CFR L9(c) if that person made the invention or by any 
concern which would not qualify as a small business concern under 37 CFR 1.9(d), or a nonprofit 
organization under 37 CFR L9(e). 

Each person, concern or organization having any fights in the invention is listed below: 

[ ] no such person, concern, or organization exists. 

[ ] each such person, concern or organization listed below 

FULL NAME: ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE 
ADDRESS: 161 Kaiong-Donff, Yusong-Gu, Daejon-Shi. Korea 

[ lINDIVIDUAL [ ISMALL BUSINESS CONCERN [ JNONPROFIT ORGANIZATION 

Separate verified statements are required from each named person, concern or organization having rights to 
the invention averring to their status as small entities. (37 CFR 1.27) 

1 acknowledge the duty to file, in this application or patent, notification of any change in status resulting in 
loss of entitlement to small entity status prior to paying, or at the time of paying, the earliest of the issue 
fee or anv maintenance fee due after the date on which status as a small entity is no longer appropriate. 
(37 CFR i -28(b)) 

I hereby declare that all statements made herein of my own knowledge are true and that all statements 
made on information and belief are believed to be true; and further that these statements were made with 
the knowledge that willful false statements and the like so made are punishable by fine or imprisonment, or 
both, under section 1001 of Title 18 of the United States Code, and that such willful false statements may 
jeopardize the validity of the application, any patent issuing thereon, or any patent to which this verified 
statement is directed. 

NAME OF PERSON SIGNING Keun Tang Song 

TITLE OF PERSON OTHER THAN OWNER Head of Intellectual Property Section 

ADDRESS OF PERSON SIGNING 161 Ka bn g-Dong. Yuson g-Gu. Daei on-Shi, Korea 

SIGNATURE 3 " <^-^-x^ ^^W^ DAtF 2Q7iO/T997 
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Includes Reference to POT International Applications 


Attorney's Docket No. 


As a below named inventor, I hereby declare that: 






My residence, post office address and citizenship are 


as stated below next to my name. 


I believe I am the original, first and sole inventor (if only one name is listed below) or an original, 
first and joint inventor (if plural names are listed below) of the subject matter which is claimed and 
for which a patent is sought on the invention entitled: 

A TEXT-TO'SPEECH CONVERSION SYSTEM FOR INTERLOCKING WITH MULTIMEDIA 


AND A METHOD FOR ORGANIZING INPUT DATA OF THE SAME 


the specification of which (check only one item below) 
[ ] is attached hereto 






[ ] was filed as United States application 








Serial No. 










on 


and was amended 










on 




(if annlicable). 




U [ ] was filed as PCT international application 








Number 










on 


and was amended under PCT Article 19 








on 




(if applicable). 




O I hereby state that I have reviewed and understand the contents of the above- identified specification, 
flj including the claims, as amended by any amendment referred to above. 


I acknowledge the duty to disclose information which is material to the patentability 
accordance with Title 37, Code of Federal Regulations, § .56(a). 


of the application in 


fn I hereby claim foreign priority benefits under Title 35, United States Code, § 19 of any foreign 

application(s) for patent or inventor's certificate or of any PCT international application (s) designating at 
least one country other than the United States of America listed below and have also identified below any 
foreign application (s) for patent or inventor's certificate or any PCT international application (s) 
designating at least one country other than the United States of America filed by me on the same subject 
matter having a filing date before that of the application(s) of which priority is claimed. 


PRIOR FOREIGN/PCT APPLICATIONS AND ANY PRIORITY CLAIMS UNDER 35 U.S.C. 119: 


Country 
(if PCT, indicate "PCT'O 


Application Number 


Date of Filing 
(day, month, year) 


Priority Claimed 
Under 3o U.S.C. 119 


Korea 


97-17615 


08/05/1997 


[X] YES 


[ ] NO 








[ ] YES 


[ ] NO 








[ ] YES 


[ ] NO 








[ ] YES 


[ ] NO 








[ ] YES 


[ ] NO 
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I hereby claim the benefit under Title 35, United States Code, § 120 of any United States apphcation(s) or POT 
international application (s) designating the United States of America that is/are listed below and, insofar as the 
subject matter of each of the claims of this application is not disclosed in that/those prior application (s) in the 
manner provided by the first paragraph of Title 35, United States Code, § 112, I acknowledge the duty to 
disclose material information as defined in Title 37, Code of Federal Regulations, § 1.56(a) which occurred 
between the filing date of the prior application(s) and the national or PCT international filing date of this 
application- 


PRIOR U.S. APPLICATIONS OR PCT INTERNATIONAL APPLICATIONS DESIGNATING THE U.S. FOR 
BENEFIT UNDER 35 U.S.C. 120: 


U.S. APPLICATIONS 


STATUS (check one) 


U.S. APPLICATION NUMBER 


U.S. FILING DATE 


PATENTED 


PENDING 


ABANDONED 
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PCT APPLICATION 
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3 agent(s) to prosecute this application and transact all business in the Patent and Trademark Office 
==;J connected therewith (List name and registration number) 

ij MYRON COHEN, Reg. No. 17,358; THOMAS C. PONTANI, Reg. No. 29,763; LANCE J, LIEBERMAN, 
Reg. No. 28,437; MARTIN B. PA VANE, Reg. No. 28,337; MICHAEL C. STUART, Reg. No. 35,698; 

-1 JAMES J. DeCARLO, Reg. No. 36,120; CAROL E. ROZEK, Reg. No. 36,993; EDWARD M. WEISZ, Reg. 

r= No. 37,257; KLAUS P. STOFFEL, Reg. No. 31,668; CHI K. ENG, Reg. No. 38,870; EDWARD ETKIN, Reg. 
No. 37,824; CHERYL COHEN, Reg. No. 40,361; and JULIA S. KIM, Reg, No. 36,567. 


"'Send correspondence to^ 

Martin B- Pavane, Esq, 
Reg. No- 28,337 

Cohen, Pontani, Lieberman & Pavane 
551 Fifth Avenue, Suite 1210 
New York, New York 10176 


Direct Telephone calls to: 
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Martin B. Pavane, Esq. 
(212) 687-2770 
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